1 Extracting transportation safety data from online sources

1.2 Traffic flow data

1.2.1 Historical data (yearly)

FHWA has provided Annual Average Daily Traffic (AADT) from 2011 to 2017. As an illustration, the following code chunk displays the first five observations of AADT data for Missouri 2017.

Historical traffic data in Missouri, 2017
Year_Recor State_Code Route_ID Begin_Poin End_Point Route_Numb Route_Name Route_Qual Route_Sign F_System Facility_T Urban_Code County_Cod Ownership NHS STRAHNET Truck_NN Through_La Speed_Limi Access_Con AADT AADT_Singl AADT_Combi IRI PSR Toll_Charg Toll_ID Toll_Type ESRI_OID Shape_Leng
2017 29 52 107.4 107.5 52 52 1 4 4 2 99999 15 1 0 0 0 2 0 0 4075 0 0 82 0 0 NA 0 1 0.0017938
2017 29 1985 256.1 256.2 54 54 1 3 3 2 99999 163 1 1 0 1 2 0 3 4389 164 934 50 0 0 NA 0 2 0.0018675
2017 29 116562 6.5 6.6 0 STATE LINE RD 1 6 4 2 43912 95 2 0 0 0 4 40 3 27711 544 918 400 0 0 NA 0 3 0.0014497
2017 29 4442 2.4 2.5 0 H 1 4 5 2 99999 195 1 0 0 0 2 0 0 690 0 0 138 0 0 NA 0 4 0.0018565
2017 29 2809 10.3 10.4 0 Z 1 4 5 2 99999 67 1 0 0 0 2 0 0 498 0 0 140 0 0 NA 0 5 0.0014515

The downloaded “shape files” can be converted to different data formats (e.g., .csv) using the following R code.

1.2.2 Real-time data (<= 5 minutes)

There are several sources for getting real-time traffic data. Some of the states in the USA are equipped with loop detectors and video cameras. Departments of Transportation (DoT) can provide this data. Further, HERE website also provides near real-time traffic data with the limitation of x APIs per day for free. Later, a detailed instruction for getting this data form HERE website is provided.

  • State DoTs|loop detector
  • State DoTs|video frames

HERE Website

HERE|Phone based

1.3 Weather data

1.3.1 Historical (daily)

NOAA

1.3.2 Real-time (<= 1 hour)

In this part, we show how to get both historical and real-time weather data using DarkSky API. It can be used in both Python and R. Before using the DarkSky API to get weather data, you need to register for a API key on its official website. The first 1000 API requests you make each day are free, but each API request over the 1000 daily limit will cost you $0.0001, which means a million extra API requests will cost you 100 USD.

To get weather data from the DarkSky API, you need to provide the following information on trucks:

  1. latitude
  2. longitude
  3. date and time

Then you can pass these three parameters to the get_forecast_for() function in darksky package in R.

1.4 DarkSky returned data

For each datum, data returned by the darksky API includes a list of 3 data.frames:

  1. hourly weather data. 24 hourly observations for each 15 weather variables in that day.
  2. daily weather data. 1 observations for each 34 weather variables in that day.
  3. current weather data. 1 observations for each 15 weather variables at the assigned time point.

The variables include: apparent (feels-like) temperature, atmospheric pressure, dew point, humidity, liquid precipitation rate, moon phase, nearest storm distance, nearest storm direction, ozone, precipitation type, snowfall, sun rise/set, temperature, text summaries, uv index, wind gust, wind speed, wind direction

1.4.1 hourly weather

time summary icon precipIntensity precipProbability temperature apparentTemperature dewPoint humidity pressure windSpeed windGust windBearing cloudCover uvIndex visibility
2016-01-17 23:00:00 Overcast cloudy 0 0 27.27 27.27 24.11 0.88 1004.61 0.77 2.59 219 1.00 0 10.00
2016-01-18 00:00:00 Mostly Cloudy partly-cloudy-night 0 0 26.68 26.68 23.50 0.88 1003.95 0.81 2.66 249 0.79 0 10.00
2016-01-18 01:00:00 Partly Cloudy partly-cloudy-night 0 0 26.19 26.19 23.01 0.88 1004.09 1.03 4.04 265 0.49 0 10.00
2016-01-18 02:00:00 Mostly Cloudy partly-cloudy-night 0 0 26.35 26.35 22.84 0.86 1004.28 2.71 7.68 264 0.64 0 10.00
2016-01-18 03:00:00 Mostly Cloudy partly-cloudy-night 0 0 26.77 22.65 19.54 0.74 1004.48 3.52 12.21 290 0.75 0 10.00
2016-01-18 04:00:00 Overcast cloudy 0 0 24.82 17.83 15.50 0.67 1005.71 5.84 15.04 301 1.00 0 4.55
2016-01-18 05:00:00 Overcast cloudy 0 0 20.40 12.77 11.52 0.68 1007.11 5.64 17.41 326 1.00 0 8.64
2016-01-18 06:00:00 Clear clear-night 0 0 19.03 13.49 8.08 0.62 1008.00 3.72 14.77 303 0.24 0 10.00
2016-01-18 07:00:00 Clear clear-day 0 0 18.63 11.50 6.19 0.58 1008.91 4.89 15.13 311 0.00 0 10.00
2016-01-18 08:00:00 Clear clear-day 0 0 19.49 11.76 6.36 0.56 1009.32 5.57 17.34 300 0.20 0 10.00
2016-01-18 09:00:00 Clear clear-day 0 0 21.14 14.00 6.38 0.52 1009.47 5.30 15.53 315 0.08 1 10.00
2016-01-18 10:00:00 Clear clear-day 0 0 22.78 15.32 5.86 0.48 1009.71 5.92 16.88 309 0.12 2 10.00
2016-01-18 11:00:00 Partly Cloudy partly-cloudy-day 0 0 24.11 17.41 5.23 0.44 1009.12 5.39 16.75 318 0.30 2 10.00
2016-01-18 12:00:00 Partly Cloudy partly-cloudy-day 0 0 24.70 16.89 2.93 0.38 1008.97 6.77 20.25 308 0.44 2 10.00
2016-01-18 13:00:00 Partly Cloudy partly-cloudy-day 0 0 24.14 16.78 2.67 0.39 1008.92 6.09 19.41 313 0.59 1 10.00
2016-01-18 14:00:00 Mostly Cloudy partly-cloudy-day 0 0 23.15 15.64 1.59 0.38 1009.35 6.06 18.83 297 0.75 1 10.00
2016-01-18 15:00:00 Partly Cloudy partly-cloudy-day 0 0 22.16 14.88 2.69 0.42 1010.17 5.61 18.33 299 0.44 0 10.00
2016-01-18 16:00:00 Partly Cloudy partly-cloudy-night 0 0 20.40 14.37 4.03 0.48 1010.93 4.24 16.11 280 0.41 0 10.00
2016-01-18 17:00:00 Mostly Cloudy partly-cloudy-night 0 0 19.08 12.65 5.86 0.56 1011.85 4.39 15.06 295 0.61 0 10.00
2016-01-18 18:00:00 Clear clear-night 0 0 18.57 11.49 4.81 0.54 1012.38 4.84 15.47 292 0.08 0 10.00
2016-01-18 19:00:00 Clear clear-night 0 0 17.78 11.27 3.39 0.53 1012.91 4.28 15.96 284 0.04 0 10.00
2016-01-18 20:00:00 Clear clear-night 0 0 17.38 10.38 3.60 0.54 1012.91 4.61 14.81 281 0.10 0 10.00
2016-01-18 21:00:00 Clear clear-night 0 0 16.84 11.01 3.70 0.56 1013.27 3.70 13.80 283 0.20 0 10.00
2016-01-18 22:00:00 Clear clear-night 0 0 16.47 9.29 3.82 0.57 1013.41 4.63 13.58 287 0.17 0 10.00

1.4.2 daily weather

time summary icon sunriseTime sunsetTime moonPhase precipIntensity precipIntensityMax precipProbability temperatureHigh temperatureHighTime temperatureLow temperatureLowTime apparentTemperatureHigh apparentTemperatureHighTime apparentTemperatureLow apparentTemperatureLowTime dewPoint humidity pressure windSpeed windGust windGustTime windBearing cloudCover uvIndex uvIndexTime visibility temperatureMin temperatureMinTime temperatureMax temperatureMaxTime apparentTemperatureMin apparentTemperatureMinTime apparentTemperatureMax apparentTemperatureMaxTime
2016-01-17 23:00:00 Partly cloudy starting in the afternoon, continuing until evening. partly-cloudy-day 2016-01-18 06:20:37 2016-01-18 15:56:30 0.32 0 0 0 24.7 1453140000 14.1 1453204800 17.41 1453136400 6.4 1453204800 9.05 0.59 1008.91 4.26 20.25 1453140000 299 0.43 2 1453132800 9.71 16.47 2016-01-18 22:00:00 27.27 2016-01-17 23:00:00 9.29 2016-01-18 22:00:00 27.27 2016-01-17 23:00:00

1.4.3 currently weather

time summary icon precipIntensity precipProbability temperature apparentTemperature dewPoint humidity pressure windSpeed windGust windBearing cloudCover uvIndex visibility
2016-01-18 01:22:18 Partly Cloudy partly-cloudy-night 0 0 26.25 26.25 22.95 0.87 1004.16 1.65 5.39 265 0.54 0 10

2 Descriptive analytic tools used for understanding transportation safety data

2.1 An example of clustering

The following codes attempts to replicate the visual clustering approach from

Van Wijk, Jarke J., and Edward R. Van Selow. 1999. “Cluster and Calendar Based Visualization of Time Series Data.” In Information Visualization, 1999.(Info Vis’ 99) Proceedings. 1999 IEEE Symposium on, 4-9. IEEE.

A brief example of applying EDA methods on traffic data is provided here. The goal of this example is to illustrate the efficiency of the mentioned tools in the transportation context. There is no predetermined way to utilize these methods. The efficiency of each method highly depends on the nature of the problem. Hence, the challenge is to choose the right tool which fits the best.

2.1.1 Collecting Data

Hourly vehicle counts data is used in this example. It provides the number of vehicles which passed along a particular segment of a road in one hour. Data is extracted from the Georgia Department of Transportation (GDoT) (Georgia Department of Transportation, 2015) for 2015 from station 121-5505 which located in Atlanta. GDoT provides data in separate sheets for each month. After extracting and cleaning data, it was combined to one sheet with 365 rows (days) and 24 columns (hours). Data can be downloaded from GDoT.

###Clustering It is almost impossible to understand raw data and also discover interesting patterns in it by just looking at 8760 (370 * 24) data cells. Hence, K-means clustering method is utilized here to present data in a more understandable format. K-means clustering is a common technique to explore data and discover patterns by grouping similar data to predefined (k) number of clusters. K-means clustering aims to group data into k clusters in a way to minimize the within-cluster sum of squares (WCSS). To find the optimal number of clusters, we have used a method that was suggested by Pham et al. (2005). According to the following graph two is the best number of clusters to group this data.

## kmeans(): generating initial means
## kmeans(): n_threads: 8
## kmeans(): iteration:    1   delta: 8.45358e+06
## kmeans(): iteration:    2   delta: 22935.6
## kmeans(): iteration:    3   delta: 1864.51
## kmeans(): iteration:    4   delta: 0
## 
##   1   2 
## 119 246

2.1.2 Visualization

Now, k-means clustering can be applied. The output of this step is a column which its value is either one or two,indicating that each row of data (day) belongs to cluster one or two. Now data is divided into two groups. However, still we need to transfer data to a visual format to somehow validate and guide the clustering process. Since our data contains temporal information, we have used Cluster Calendar View visualization technique which is introduced by Van Wijk and Van Selow (1999). In this technique, a calendar represents the temporal information of data and by using color coding, differences between clusters are distinguished. The following graph shows a cluster calendar view for our data. It clearly has found meaningful patterns in the vehicle counts data. Weekends and weekdays have different traffic patterns. Besides, it has captured some of the holidays. For example, the 4th of July (Independence Day) which is a weekday, is colored by light blue. It means that this day has a similar traffic pattern with weekends. In addition, the clustering method has identified other holidays like Martin Luther King Day, Memorial Day, Labor Day, Thanksgiving Day and Christmas Day.

Furthermore, a line chart (following graph) is used to show the average hourly traffic data for the two clusters. Results show that each cluster has different peaks and valleys. On the weekdays, 7 AM and 4 PM have the greatest number of vehicles which can be explained by the official working hours. On the other hand, on weekends, the traffic peak is around 1 PM which maybe refers to some people going out for lunch.

To sum up, it seems that K-means clustering method was very efficient here. We applied raw data as inputs to this method and as outputs we could discover patterns (weekdays and weekends traffic patterns) and also with the help of visualization technique we obtained a considerable information about the data.


  1. Department of Epidemiology and Biostatistics, Saint Louis University. Email address miao.cai@slu.edu

  2. Department of Industrial and Systems Engineering, Auburn University. Email address azm0127@auburn.edu

  3. Carey Business School, Johns Hopkins Universitymza0052@auburn.edu

  4. Farmer School of Business, Miami University. Email address fmegahed@miamioh.edu.